95 research outputs found

    Changing Views on Curves and Surfaces

    Full text link
    Visual events in computer vision are studied from the perspective of algebraic geometry. Given a sufficiently general curve or surface in 3-space, we consider the image or contour curve that arises by projecting from a viewpoint. Qualitative changes in that curve occur when the viewpoint crosses the visual event surface. We examine the components of this ruled surface, and observe that these coincide with the iterated singular loci of the coisotropic hypersurfaces associated with the original curve or surface. We derive formulas, due to Salmon and Petitjean, for the degrees of these surfaces, and show how to compute exact representations for all visual event surfaces using algebraic methods.Comment: 31 page

    Multigraded Cayley-Chow forms

    Full text link
    We introduce a theory of multigraded Cayley-Chow forms associated to subvarieties of products of projective spaces. Two new phenomena arise: first, the construction turns out to require certain inequalities on the dimensions of projections; and second, in positive characteristic the multigraded Cayley-Chow forms can have higher multiplicities. The theory also provides a natural framework for understanding multifocal tensors in computer vision.Comment: 20 pages, 1 figur

    Towards Visual Foundational Models of Physical Scenes

    Full text link
    We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.Comment: TLDR: Physical scenes are equivalence classes of sufficient statistics, and can be inferred uniquely by any agent measuring the same finite data; We formalize and implement an approach to representation learning that overturns "naive realism" in favor of an analytical approach of Russell and Koenderink. NeRFs cannot capture the physical scenes, but combined with Diffusion Models they ca

    Prompt Algebra for Task Composition

    Full text link
    We investigate whether prompts learned independently for different tasks can be later combined through prompt algebra to obtain a model that supports composition of tasks. We consider Visual Language Models (VLM) with prompt tuning as our base classifier and formally define the notion of prompt algebra. We propose constrained prompt tuning to improve performance of the composite classifier. In the proposed scheme, prompts are constrained to appear in the lower dimensional subspace spanned by the basis vectors of the pre-trained vocabulary. Further regularization is added to ensure that the learned prompt is grounded correctly to the existing pre-trained vocabulary. We demonstrate the effectiveness of our method on object classification and object-attribute classification datasets. On average, our composite model obtains classification accuracy within 2.5% of the best base model. On UTZappos it improves classification accuracy over the best base model by 8.45% on average

    Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

    Full text link
    We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.Comment: 18 pages, 9 figures, 7 table

    Trinocular Geometry Revisited

    Get PDF
    International audienceWhen do the visual rays associated with triplets of point correspondences converge, that is, intersect in a common point? Classical models of trinocular geometry based on the fundamental matrices and trifocal tensor associated with the corresponding cameras only provide partial answers to this fundamental question, in large part because of underlying, but seldom explicit, general configuration assumptions. This paper uses elementary tools from projective line geometry to provide necessary and sufficient geometric and analytical conditions for convergence in terms of transversals to triplets of visual rays, without any such assumptions. In turn, this yields a novel and simple minimal parameterization of trinocular geometry for cameras with non-collinear or collinear pinholes, which can be used to construct a practical and efficient method for trinocular geometry parameter estimation. We present numerical experiments using synthetic and real data

    The joint image handbook

    Get PDF
    International audienceGiven multiple perspective photographs, point correspondences form the " joint image " , effectively a replica of three-dimensional space distributed across its two-dimensional projections. This set can be characterized by multilinear equations over image coordinates, such as epipolar and trifocal constraints. We revisit in this paper the geometric and algebraic properties of the joint image, and address fundamental questions such as how many and which multilinearities are necessary and/or sufficient to determine camera geometry and/or image correspondences. The new theoretical results in this paper answer these questions in a very general setting and, in turn, are intended to serve as a " handbook " reference about multilinearities for practitioners
    • …
    corecore